1 Introduction

Lung cancer accounts for more deaths than any other cancer in both men and women, about 28% of all cancer deaths. The prognosis for lung cancer is poor. The Cancer Genome Atlas (TCGA) has samples of the most common type of lung cancer called non-small cell lung cancers. Specifically, the subtypes being studied are called lung adenocarcinoma and lung squamous cell carcinoma. Here we analyze the expression profiles of those patients, accessible in the form of a raw RNA-seq counts using a pipeline based on the R/Bioconductor software package Rsubread.

2 Import Data

We import the raw data from the SummarizedExperiment. It contains only one assay, in a matrix-like object with the genes in the row and the samples in the columns. The data is the counts for gene and sample.

library(SummarizedExperiment)

se <- readRDS(file.path("rawCounts", "seLUSC.rds"))
se
class: RangedSummarizedExperiment 
dim: 20115 553 
metadata(5): experimentData annotation cancerTypeCode
  cancerTypeDescription objectCreationDate
assays(1): counts
rownames(20115): 1 2 ... 102724473 103091865
rowData names(3): symbol txlen txgc
colnames(553): TCGA.18.3406.01A.01R.0980.07
  TCGA.18.3407.01A.01R.0980.07 ... TCGA.90.7767.11A.01R.2125.07
  TCGA.92.7340.11A.01R.2045.07
colData names(549): type bcr_patient_uuid ...
  lymph_nodes_aortic_pos_by_ihc lymph_nodes_aortic_pos_total

The raw table of counts contains RNA-seq data from 20115 genes and 553 samples.

3 Extract paired samples

We want to do a paired experiment. For this reason, from the 553 samples, we will keep those that share patient identification, which is in the factor bcr_patient_barcode.

We select those patient identification that are in normal tissue and in tumor tissue.

normal <- data.frame(colData(se)[colData(se)$type == 'normal',])$bcr_patient_barcode
tumor <- data.frame(colData(se)[colData(se)$type == 'tumor',])$bcr_patient_barcode
length(normal)
[1] 51
length(tumor)
[1] 502

We have 51 samples with normal tissue and 502 samples with tumoral tissue. Let’s see which of the normal samples share patient identification with tumor samples.

common_bcr_patient_barcode <- normal[normal %in% tumor]
length(common_bcr_patient_barcode)
[1] 47

47 patient identification are in normal samples and tumor samples. This means that for those patients, there were extraction for normal tissue and for tumoral tissue.

Now, we filter the SummarizedExperiment object to keep only these patients.

paired_seLUSC <- se[,colData(se)$bcr_patient_barcode %in% common_bcr_patient_barcode]
paired_seLUSC
class: RangedSummarizedExperiment 
dim: 20115 94 
metadata(5): experimentData annotation cancerTypeCode
  cancerTypeDescription objectCreationDate
assays(1): counts
rownames(20115): 1 2 ... 102724473 103091865
rowData names(3): symbol txlen txgc
colnames(94): TCGA.22.4593.01A.21R.1820.07
  TCGA.22.4609.01A.21R.2125.07 ... TCGA.90.7767.11A.01R.2125.07
  TCGA.92.7340.11A.01R.2045.07
colData names(549): type bcr_patient_uuid ...
  lymph_nodes_aortic_pos_by_ihc lymph_nodes_aortic_pos_total

Now, there are 20115 genes, the same as before because we have not filtered genes, and 94 samples, which are 47 normal samples and 47 tumoral samples, with the same patient identificacion.

length(unique(paired_seLUSC$bcr_patient_barcode))
[1] 47
length(levels(paired_seLUSC$bcr_patient_barcode))
[1] 495

Analyzing the length of the array that contains the patient identification and its levels, we see that, although we have discarded the patients that did not pass the filter of the paired design, the levels are still with all the patient identifications, 495.

When doing differential expression analysis, the levels are taken to build the design matrix. For this reason, as the leves are still the one in the raw data, we have to take out the levels that are not in the array of patient identifications.

paired_seLUSC$bcr_patient_barcode <- droplevels(paired_seLUSC$bcr_patient_barcode)
length(unique(paired_seLUSC$bcr_patient_barcode))
[1] 47
length(levels(paired_seLUSC$bcr_patient_barcode))
[1] 47

Now we have 47 patient identifications and also 47 levels. Now we can save this filtered SummarizedExperiment in an RDS file.

saveRDS(paired_seLUSC, file.path("results", "paired_seLUSC.rds"))

4 Data import

We start importing the table of counts from the SummarizedExperiment container that is already paired for the previous filtering process. The structure is the same: the rows represent genes and the columns represent samples.

library(SummarizedExperiment)

paired_se <- readRDS(file.path("results", "paired_seLUSC.rds"))
paired_se
class: RangedSummarizedExperiment 
dim: 20115 94 
metadata(5): experimentData annotation cancerTypeCode
  cancerTypeDescription objectCreationDate
assays(1): counts
rownames(20115): 1 2 ... 102724473 103091865
rowData names(3): symbol txlen txgc
colnames(94): TCGA.22.4593.01A.21R.1820.07
  TCGA.22.4609.01A.21R.2125.07 ... TCGA.90.7767.11A.01R.2125.07
  TCGA.92.7340.11A.01R.2045.07
colData names(549): type bcr_patient_uuid ...
  lymph_nodes_aortic_pos_by_ihc lymph_nodes_aortic_pos_total

The row table of counts contains RNA-seq data from 20115 genes and 94 samples, which come from 47 patients.

We explore the column, that corresponds to samples. It contains the phenotypic data, which in this case corresponds to clinical variables, and their corresponding metadata.

dim(colData(paired_se))
[1]  94 549
colData(paired_se)[1:5, 1:5]
DataFrame with 5 rows and 5 columns
                                 type                     bcr_patient_uuid
                             <factor>                             <factor>
TCGA.22.4593.01A.21R.1820.07    tumor 8fbe1f9e-f2a6-4550-aabf-b06607b821f0
TCGA.22.4609.01A.21R.2125.07    tumor 47cbb242-a356-43ba-95c3-0b0c9ef90fbf
TCGA.22.5471.01A.01R.1635.07    tumor bece6b8e-5d6c-4dd6-85a3-9b3a9c670aa7
TCGA.22.5472.01A.01R.1635.07    tumor bd3bf142-7c14-4538-8a76-3c6e140fa01a
TCGA.22.5478.01A.01R.1635.07    tumor 4daf4a91-bc36-40c8-8fca-ea61b6706775
                             bcr_patient_barcode form_completion_date
                                        <factor>             <factor>
TCGA.22.4593.01A.21R.1820.07        TCGA-22-4593             2011-8-5
TCGA.22.4609.01A.21R.2125.07        TCGA-22-4609            2012-1-16
TCGA.22.5471.01A.01R.1635.07        TCGA-22-5471             2011-5-5
TCGA.22.5472.01A.01R.1635.07        TCGA-22-5472            2011-4-29
TCGA.22.5478.01A.01R.1635.07        TCGA-22-5478             2011-5-5
                             prospective_collection
                                           <factor>
TCGA.22.4593.01A.21R.1820.07                     NO
TCGA.22.4609.01A.21R.2125.07                     NO
TCGA.22.5471.01A.01R.1635.07                     NO
TCGA.22.5472.01A.01R.1635.07                     NO
TCGA.22.5478.01A.01R.1635.07                     NO
mcols(colData(paired_se), use.names=TRUE)
DataFrame with 549 rows and 2 columns
                                                         labelDescription
                                                              <character>
type                                           sample type (tumor/normal)
bcr_patient_uuid                                         bcr patient uuid
bcr_patient_barcode                                   bcr patient barcode
form_completion_date                                 form completion date
prospective_collection            tissue prospective collection indicator
...                                                                   ...
lymph_nodes_pelvic_pos_total                               total pelv lnp
lymph_nodes_aortic_examined_count                           total aor lnr
lymph_nodes_aortic_pos_by_he                          aln pos light micro
lymph_nodes_aortic_pos_by_ihc                                 aln pos ihc
lymph_nodes_aortic_pos_total                                total aor-lnp
                                        CDEID
                                  <character>
type                                       NA
bcr_patient_uuid                           NA
bcr_patient_barcode                   2673794
form_completion_date                       NA
prospective_collection                3088492
...                                       ...
lymph_nodes_pelvic_pos_total          3151828
lymph_nodes_aortic_examined_count     3104460
lymph_nodes_aortic_pos_by_he          3151832
lymph_nodes_aortic_pos_by_ihc         3151831
lymph_nodes_aortic_pos_total          3151827

These metadata consists of two columns of information about the clinical variables. One called labelDescription contains a succint description of the variable, often not more self-explanatory than the variable name itself, and the other called ‘CDEID’ corresponds to the Common Data Element (CDE) identifier. This identifier can be useed to search for further information about the associated clinical variable.

Now, we explore the row (feature) data.

rowData(paired_se)
DataFrame with 20115 rows and 3 columns
           symbol     txlen      txgc
      <character> <integer> <numeric>
1            A1BG      3322 0.5644190
2             A2M      4844 0.4882329
3            NAT1      2280 0.3942982
4            NAT2      1322 0.3895613
5        SERPINA3      3067 0.5249429
...           ...       ...       ...
20111       POTEB      1706 0.4308324
20112    SNORD124       104 0.4903846
20113   SNORD121B        81 0.4074074
20114      GAGE10       538 0.5055762
20115   BRWD1-IT2      1028 0.5924125
rowRanges(paired_se)
GRanges object with 20115 ranges and 3 metadata columns:
            seqnames               ranges strand |      symbol     txlen
               <Rle>            <IRanges>  <Rle> | <character> <integer>
          1    chr19 [58345178, 58362751]      - |        A1BG      3322
          2    chr12 [ 9067664,  9116229]      - |         A2M      4844
          9     chr8 [18170477, 18223689]      + |        NAT1      2280
         10     chr8 [18391245, 18401218]      + |        NAT2      1322
         12    chr14 [94592058, 94624646]      + |    SERPINA3      3067
        ...      ...                  ...    ... .         ...       ...
  100996331    chr15 [20835372, 21877298]      - |       POTEB      1706
  101340251    chr17 [40027542, 40027645]      - |    SNORD124       104
  101340252     chr9 [33934296, 33934376]      - |   SNORD121B        81
  102724473     chrX [49303669, 49319844]      + |      GAGE10       538
  103091865    chr21 [39313935, 39314962]      + |   BRWD1-IT2      1028
                 txgc
            <numeric>
          1 0.5644190
          2 0.4882329
          9 0.3942982
         10 0.3895613
         12 0.5249429
        ...       ...
  100996331 0.4308324
  101340251 0.4903846
  101340252 0.4074074
  102724473 0.5055762
  103091865 0.5924125
  -------
  seqinfo: 455 sequences (1 circular) from hg38 genome

We are going to compare normal samples with tumor samples. Exploring the metadata, we can see that we have 47 normal samples and 47 tumor samples. As they are paired, the number must be the same.

table(paired_se$type)

normal  tumor 
    47     47 

5 Quality assessment

Normalitzation consists in the adjustment for sample and gene specific factors, to make gene expression values comparable across samples. This process is really important due to the fact that the samples may have been sequenced at different depth and that there may be sample specific biases related to technical differences in samples extarction, preparation and sequencing.

The normalitzation is done in two steps: * Within-sample: adjustments to compare across features in a sample. Scaling: using counts per million reads (CPM) mapped to the genome. Between-sample: adjustments to compare a feature across samples. Sample-specific normalization factors: using the TMM algorithm from the R/Bioconductor package edgeR. Quantile normalization: using the CQN algorithm from the R/Bioconductor package cqn or in the limma-voom pipeline.

To perform quality assessment and normalization we need first to load the edgeR R/Bioconductor package and create a DGEList object.

library(edgeR)

paired_dge <- DGEList(counts=assays(paired_se)$counts, genes=mcols(paired_se))
Warning in as.data.frame(x, row.names = NULL, optional = optional, ...):
Arguments in '...' ignored
names(paired_dge)
[1] "counts"  "samples" "genes"  
saveRDS(paired_dge, file.path("results", "paired_dge.rds"))

Once the ‘DGEList’ object is created, we can performe the scaling to CPM values. Therefore, \(\log_2\) CPM values of expression are calculated and used as an additional assay element to ease their manipulation. \(\log_2\) CPM units separate better high and low expression, than raw counts or non-logged CPM units.

assays(paired_se)$logCPM <- cpm(paired_dge, log=TRUE, prior.count=0.5)
assays(paired_se)$logCPM[1:5, 1:5]
   TCGA.22.4593.01A.21R.1820.07 TCGA.22.4609.01A.21R.2125.07
1                      2.433074                     1.760274
2                      9.013463                    10.810625
9                     -6.857108                    -6.857108
10                    -6.857108                    -6.857108
12                     5.605901                     5.647349
   TCGA.22.5471.01A.01R.1635.07 TCGA.22.5472.01A.01R.1635.07
1                      2.122585                    0.7750831
2                      7.713748                   11.1007352
9                     -6.857108                   -6.8571079
10                    -6.857108                   -6.8571079
12                     4.823934                    5.8569327
   TCGA.22.5478.01A.01R.1635.07
1                      3.713818
2                      9.448173
9                     -6.857108
10                    -6.857108
12                     4.920808

5.1 Sequencing depth

Let’s examine the sequencing depth in terms of total number of sequence read counts mapped to the genome per sample. Figure 1 below shows the sequencing depth per sample, also known as library sizes, in increasing order.

Library sizes in increasing order.

Figure 1: Library sizes in increasing order

There is the same number of tumor samples and normal samples and they seem to be randonly distributed. However, there is still high variability in the library size as can be observed in Figure 1.

5.2 Distribution of expression levels among samples

Let’s look at the distribution of expression values per sample in terms of logarithmic CPM units. Due to the large number of samples, we display tumor and normal samples separately, and are shown in Figure 2

Non-parametric density distribution of expression profiles per sample.

Figure 2: Non-parametric density distribution of expression profiles per sample

In Figure 2 we do not appreciate substantial differences between the samples in the distribution of expression values as.

5.3 Distribution of expression levels among genes

Let’s calculate now the average expression per gene through all the samples. Figure 3 shows the distribution of those values across genes.

Distribution of average expression level per gene.

Figure 3: Distribution of average expression level per gene

RNA-seq expression profiles from lowly-expressed genes can lead to artifacts in downstream differential expression analyses. For this reason, it is common practice to remove them following some criteria, such as: filter out genes below a minimum average CPM (or log2 CPM) value throughout the samples or filter out genes with fewer than a given number of samples meeting a minimum CPM (or log2 CPM) cutoff. This graphic shows what would be the minimum average CPM (red line).

We can also make an MA-plot to see biases due to expression. First, we define the groups that we want to compare. In our case, we use the sample type to define groups, by modifying the DGEList object as follows:

paired_dge$samples$group <- paired_se$type
table(paired_dge$samples$group)

normal  tumor 
    47     47 

Here we generate the MA-plot using type sample as grouping factor.

MA-plot using type grouping.

Figure 4: MA-plot using type grouping

Figure 4 shows us the need of remove the lower experssed genes to normalize the samples.

5.4 Filtering of lowly-expressed genes

In the light of this plot, we may consider a cutoff of 1 log CPM unit as minimum value of expression to select genes being expressed across samples. Using this cutoff we proceed to filter out lowly-expressed genes.

mask <- avgexp > 1
dim(paired_se)
[1] 20115    94
paired_se.filt <- paired_se[mask, ]
dim(paired_se.filt)
[1] 11866    94
paired_dge.filt <- paired_dge[mask, ]
dim(paired_dge.filt)
[1] 11866    94

After this filtering process, we end up with 11866 genes.

Store un-normalized versions of the filtered expression data.

saveRDS(paired_se.filt, file.path("results", "paired_se.filt.unnorm.rds"))
saveRDS(paired_dge.filt, file.path("results", "paired_dge.filt.unnorm.rds"))

We can also use a second approach to filter data using the CPM cutoff value of expression. We will keep only genes that have this minimum level of expression in at least as many samples as the smallest group of comparison. We are still comparing sample type.

cpmcutoff <- round(10/min(paired_dge$sample$lib.size/1e+06), digits = 1)
cpmcutoff
[1] 0.3
nsamplescutoff <- min(table(paired_se$type))
nsamplescutoff
[1] 47

After knowing these parameters we can proceed to mask the lower-expressed genes.

mask2 <- rowSums(cpm(paired_dge) > cpmcutoff) >= nsamplescutoff
paired_se.filt2 <- paired_se[mask2, ]
paired_dge.filt2 <- paired_dge[mask2, ]
dim(paired_se)
[1] 20115    94
dim(paired_se.filt)
[1] 11866    94
dim(paired_se.filt2)
[1] 14200    94

After this second kind of filtering, we end up with 14200. As we can see, the first filter is more stringent thant the second one.

Store un-normalized versions of the filtered expression data.

saveRDS(paired_se.filt2, file.path("results", "paired_se.filt2.unnorm.rds"))
saveRDS(paired_dge.filt2, file.path("results", "paired_dge.filt2.unnorm.rds"))

We will compare both approaches done before in order to compare them and see which one discards more genes and is more restrictive.

Distribution of average expression level per gene and filtering comparative.

Figure 5: Distribution of average expression level per gene and filtering comparative

After comparing the different approaches used in the filter of low expressed genes, we have decided to contunue working with the first set of filtered genes because this is more restrictive. We can see in Figure 5 that the second approach is more conservative with our dataset and even genes with negative logCPM passes through this filter.

5.5 Normalization

We calculate now the normalization factors on the filtered expression data set.

paired_dge.filt <- calcNormFactors(paired_dge.filt)

Replace the raw log2 CPM units in the corresponding assay element of the SummarizedExperiment object, by the normalized ones.

assays(paired_se.filt)$logCPM <- cpm(paired_dge.filt, log=TRUE, normalized.lib.sizes=TRUE, prior.count=0.25)

Store normalized versions of the filtered expression data.

saveRDS(paired_se.filt, file.path("results", "paired_se.filt.rds"))
saveRDS(paired_dge.filt, file.path("results", "paired_dge.filt.rds"))

5.6 MA-plots

We examine now the MA-plots of the normalized expression profiles. We look first to the tumor samples in Figure 7.

First, we define the groups that we want to compare. In our case, we use the sample type to define groups, by modifying the DGEList object as follows:

paired_dge$samples$group <- paired_se$type
table(paired_dge$samples$group)

normal  tumor 
    47     47 
paired_dge.filt$samples$group <- paired_se$type

Here we generate the MA-plot using type sample as grouping factor.

MA-plot using type grouping after filtering by low expression.

Figure 6: MA-plot using type grouping after filtering by low expression

In Figure 6 we don’t see the artifact caused by the discreteness of counts at low values, where ratios between low numbers may easy lead to large fold-changes. In general, fold-changes from large expression values are more reliable than those coming from low-expression values. We don’t see it because we have filtered by low-expression previously. Red line, which indicates us the tendency of the dots, shows that there is no bias, since it falls in the blue line. At the end it seems to decrease a little, but as we said, the fold-changes with a high average logCPM is more reliable. The problem would be if the biased was at the beginning.

MA-plots of the tumor samples.

Figure 7: MA-plots of the tumor samples

We do not observe samples with major expression-level dependent biases. Although some individual samples may have little bias, in general this is corrected. Let’s look now to the normal samples in Figure 8.

MA-plots of the normal samples.

Figure 8: MA-plots of the normal samples

MA-plots of the normal samples.

Figure 8: MA-plots of the normal samples

We do not observe neither important expression-level dependent biases among the normal samples in Figure 8. There is even less bias than in tumor samples.

5.7 Batch identification

We will search now for potential surrogate of batch effect indicators. Given that each sample names corresponds to a TCGA barcode (see https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode), following the strategy described in http://bioinformatics.mdanderson.org/main/TCGABatchEffects:Overview we are going to derive different elements of the TCGA barcode and examine their distribution across samples.

tss <- substr(colnames(paired_se.filt), 6, 7)
table(tss)
tss
22 33 34 39 43 51 56 58 60 77 85 90 92 
20  4  4  2 16  6 16  2  2 14  2  4  2 
center <- substr(colnames(paired_se.filt), 27, 28)
table(center)
center
07 
94 
plate <- substr(colnames(paired_se.filt), 22, 25)
table(plate)
plate
0980 1100 1635 1758 1820 1858 1949 2045 2125 2187 2296 2326 A28V 
   1    3   11    4   14    1    8   20   20    4    4    2    2 
portionanalyte <- substr(colnames(paired_se.filt), 18, 20)
table(portionanalyte)
portionanalyte
01R 11R 21R 31R 41R 
 55  27   8   2   2 
samplevial <- substr(colnames(paired_se.filt), 14, 16)
table(samplevial)
samplevial
01A 11A 
 47  47 

From this information we can make the following observations:

  • All samples were sequenced at the same center

  • Samples belong to two vials.

  • Samples were collected across different tissue source sites (TSS).

  • Samples were sequenced within differnt plates. There are 13 different plates.

  • Samples were sequenced using different portions and analyte combinations. There are 5 different conditions described.

We are going to use the TSS as surrogate of batch effect indicator. Considering our outcome of interest as molecular changes between sample types, tumor vs. normal, we will examine now the cross-classification of this outcome with TSS.

table(data.frame(TYPE=paired_se.filt$type, TSS=tss))
        TSS
TYPE     22 33 34 39 43 51 56 58 60 77 85 90 92
  normal 10  2  2  1  8  3  8  1  1  7  1  2  1
  tumor  10  2  2  1  8  3  8  1  1  7  1  2  1

We observe that we have the same TSS in normal and tumor samples, so the TSS is not a source of variability, since they are all the same.

We can also examine the other three parameters that can lead to variability due to technical issues. We are not interested in this variability and can be a source of confussion.

table(data.frame(TYPE=paired_se.filt$type, PORTIONALYTE=portionanalyte))
        PORTIONALYTE
TYPE     01R 11R 21R 31R 41R
  normal  44   3   0   0   0
  tumor   11  24   8   2   2
table(data.frame(TYPE=paired_se.filt$type, PLATE=plate))
        PLATE
TYPE     0980 1100 1635 1758 1820 1858 1949 2045 2125 2187 2296 2326 A28V
  normal    0    0    5    4    7    1    4   10   10    2    2    1    1
  tumor     1    3    6    0    7    0    4   10   10    2    2    1    1
table(data.frame(TYPE=paired_se.filt$type, SAMPLEVIAL=samplevial))
        SAMPLEVIAL
TYPE     01A 11A
  normal   0  47
  tumor   47   0
  • All normal samples belong to the same vial and all tumor samples belong to the same vial. As it is remarked in the documentation, this is the correct sample vial for tumoral tissue and for normal tissue. So it does not provoque any variability. Moreover, they are all from the same aliquot (A).

  • The plate is almost equal between normal and tumor.

For these reasons, the variables that may lead into batch effects are not leading to it, as commented before.

To ilustrate how samples are grouped together by hierarchical clustering and multidimensional scaling, we draw the Dendrogram plot (Figure 9) and the MDS plot (Figure 10). We calculate again log CPM values with a higher prior count to moderate extreme fold-changes produced by low counts.

Hierarchical clustering of the samples.

Figure 9: Hierarchical clustering of the samples

We can observe that samples cluster primarily by sample type: tumor or normal.

In Figure 10 we show the corresponding MDS plot. Here we see more clearly that tumor and normal samples are separated. We can also observe that one tumor samples, corresponding to individual 8623 is in the normal cluster and also happens in the MDS plot. A closer examination of their corresponding MA-plots also reveals a slight dependence of expression changes on average expression. We may consider discarding this sample and doing the MDS plot again to have a closer look to the differences among the rest of the samples.

Multidimensional scaling plot of the samples.

Figure 10: Multidimensional scaling plot of the samples

For the former case, assume that from the previous MDS plot we could decide discard that mentioned sample, which is identified with the number 8623. But we will not do it because we have problems with the matrix size after removing this sample.

As we have not eliminate batch effects, because we have decided that there is no significant since the samples are paired and most of the technical variables are compensated or the same, we will keep working with the SummarizedExperiment object from filtered genes.

Here is seen how the sample disapears after it is removed, then all the normal samples and tumor samples cluster together with any exception (Figure 11 and Figure 12.

maskbad <- colnames(paired_se.filt) %in% colnames(paired_se.filt)[substr(colnames(paired_se.filt), 9, 12) == "8623"]
dim(paired_se.filt)
[1] 11866    94
dim(paired_dge.filt)
[1] 11866    94
paired_se.filt <- paired_se.filt[, !maskbad]
paired_dge.filt <- paired_dge.filt[, !maskbad]
dim(paired_se.filt)
[1] 11866    92
dim(paired_dge.filt)
[1] 11866    92
Hierarchical clustering of the samples.

Figure 11: Hierarchical clustering of the samples

Multidimensional scaling plot of the samples.

Figure 12: Multidimensional scaling plot of the samples

saveRDS(paired_se.filt, file.path("results", "paired_se.filt2.rds"))
saveRDS(paired_dge.filt, file.path("results", "paired_dge.filt2.rds"))

6 Surrogate Variable Analysis

We perform differentual expression analysis using limma pipeline, combined with the surrogate variable analysis using the R/Bioconductor package sva.

First, we do the SVA to adjust for surrogated variables. In the design matrix is also included the patient identifications.

library(sva)

mod <- model.matrix(~type + bcr_patient_barcode, data = colData(se.filt))
mod0 <- model.matrix(~1, data = colData(se.filt))


sv <- sva(assays(se.filt)$logCPM, mod = mod, mod0 = mod0)
Number of significant surrogate variables is:  11 
Iteration (out of 5 ):1  2  3  4  5  
sv$n
[1] 11

There are 11 surrogate variables. We combine the model matrix with the surrogate variables.

mod <- cbind(mod, sv$sv)
colnames(mod) <- c(colnames(mod)[1:(nlevels(se.filt$bcr_patient_barcode)+1)], paste0("SV", 1:sv$n))

7 Differential expression analysis with limma-voom

Now we perform limma-voom method to adjust for mean-variance relationship. We use lmFit function to calculate the linear model and eBayes function to calculate the moderated t-statistic. Finally, decideTests function will classify into upregulated, downregulated or non-significant.

v <- voom(dge.filt, mod, plot=TRUE)
Voom plot

Figure 13: Voom plot

fit <- lmFit(v, mod)
fit <- eBayes(fit)

FDRcutoff <- 0.1
res <- decideTests(fit, p.value = FDRcutoff)
summary(res)
       (Intercept) typetumor bcr_patient_barcodeTCGA-22-4609
Down             2         0                               4
NotSig         473     11866                           11861
Up           11391         0                               1
       bcr_patient_barcodeTCGA-22-5471 bcr_patient_barcodeTCGA-22-5472
Down                                 2                               1
NotSig                           11862                           11863
Up                                   2                               2
       bcr_patient_barcodeTCGA-22-5478 bcr_patient_barcodeTCGA-22-5481
Down                                 0                               6
NotSig                           11865                           11858
Up                                   1                               2
       bcr_patient_barcodeTCGA-22-5482 bcr_patient_barcodeTCGA-22-5483
Down                                11                               2
NotSig                           11852                           11863
Up                                   3                               1
       bcr_patient_barcodeTCGA-22-5489 bcr_patient_barcodeTCGA-22-5491
Down                                 3                               4
NotSig                           11860                           11860
Up                                   3                               2
       bcr_patient_barcodeTCGA-33-4587 bcr_patient_barcodeTCGA-33-6737
Down                                 6                               5
NotSig                           11859                           11858
Up                                   1                               3
       bcr_patient_barcodeTCGA-34-7107 bcr_patient_barcodeTCGA-34-8454
Down                                 1                              18
NotSig                           11864                           11836
Up                                   1                              12
       bcr_patient_barcodeTCGA-39-5040 bcr_patient_barcodeTCGA-43-3394
Down                                 4                               5
NotSig                           11860                           11856
Up                                   2                               5
       bcr_patient_barcodeTCGA-43-5670 bcr_patient_barcodeTCGA-43-6143
Down                                 1                               2
NotSig                           11863                           11862
Up                                   2                               2
       bcr_patient_barcodeTCGA-43-6647 bcr_patient_barcodeTCGA-43-6771
Down                                 9                               4
NotSig                           11855                           11860
Up                                   2                               2
       bcr_patient_barcodeTCGA-43-6773 bcr_patient_barcodeTCGA-43-7657
Down                                 8                              13
NotSig                           11832                           11848
Up                                  26                               5
       bcr_patient_barcodeTCGA-43-7658 bcr_patient_barcodeTCGA-51-4079
Down                                 8                              22
NotSig                           11857                           11834
Up                                   1                              10
       bcr_patient_barcodeTCGA-51-4080 bcr_patient_barcodeTCGA-51-4081
Down                                 2                               3
NotSig                           11862                           11861
Up                                   2                               2
       bcr_patient_barcodeTCGA-56-7222 bcr_patient_barcodeTCGA-56-7579
Down                                 2                               3
NotSig                           11855                           11860
Up                                   9                               3
       bcr_patient_barcodeTCGA-56-7580 bcr_patient_barcodeTCGA-56-7582
Down                                20                               9
NotSig                           11810                           11839
Up                                  36                              18
       bcr_patient_barcodeTCGA-56-7730 bcr_patient_barcodeTCGA-56-7731
Down                                 1                               7
NotSig                           11862                           11856
Up                                   3                               3
       bcr_patient_barcodeTCGA-56-8309 bcr_patient_barcodeTCGA-56-8623
Down                                 2                               4
NotSig                           11858                           11861
Up                                   6                               1
       bcr_patient_barcodeTCGA-58-8386 bcr_patient_barcodeTCGA-60-2709
Down                                 1                               3
NotSig                           11864                           11861
Up                                   1                               2
       bcr_patient_barcodeTCGA-77-7138 bcr_patient_barcodeTCGA-77-7142
Down                                 5                              17
NotSig                           11858                           11843
Up                                   3                               6
       bcr_patient_barcodeTCGA-77-7335 bcr_patient_barcodeTCGA-77-7337
Down                                 9                               1
NotSig                           11852                           11862
Up                                   5                               3
       bcr_patient_barcodeTCGA-77-7338 bcr_patient_barcodeTCGA-77-8007
Down                                 5                               9
NotSig                           11857                           11856
Up                                   4                               1
       bcr_patient_barcodeTCGA-77-8008 bcr_patient_barcodeTCGA-85-7710
Down                                28                              88
NotSig                           11825                           11671
Up                                  13                             107
       bcr_patient_barcodeTCGA-90-6837 bcr_patient_barcodeTCGA-90-7767
Down                                 7                              13
NotSig                           11852                           11846
Up                                   7                               7
       bcr_patient_barcodeTCGA-92-7340   SV1   SV2   SV3   SV4   SV5   SV6
Down                                13  1606  2150  2551  1744  1900  1184
NotSig                           11847  8219  7875  7238  8646  8313  9463
Up                                   6  2041  1841  2077  1476  1653  1219
         SV7   SV8   SV9  SV10  SV11
Down     747   648   895   303   384
NotSig 10396 10482 10209 11399 11300
Up       723   736   762   164   182

There is no differentially expressed gene. All genes are non-significant for the type variable, which is our variable of interest.

We can build a table to classify and sort the genes by p-value.

genesmd <- data.frame(chr = as.character(seqnames(rowRanges(se.filt))), symbol = rowData(se.filt)[, 1], stringsAsFactors = FALSE)

fit$genes <- genesmd

tt <- topTable(fit, coef = 2, n = Inf)
head(tt, n = 10)
         chr    symbol      logFC  AveExpr         t      P.Value adj.P.Val
3737    chr1     KCNA2  1.6261737 7.016262  2.772653 0.0084270730 0.9997256
6670    chr2       SP3 -2.2126834 6.867519 -2.642088 0.0117319514 0.9997256
7334   chr12     UBE2N -4.0600224 2.192373 -4.235944 0.0001311969 0.9997256
56999   chr3   ADAMTS9 -2.5188837 4.158329 -3.013702 0.0044805536 0.9997256
23409  chr12     SIRT4 -0.7626027 5.447291 -2.400657 0.0211469578 0.9997256
54465   chr2     ETAA1  0.9422784 6.564915  2.386613 0.0218635472 0.9997256
339967  chr4 TMPRSS11A  3.5046682 6.821845  2.505872 0.0164193440 0.9997256
5782    chr7    PTPN12  1.3602872 8.734349  2.342342 0.0242683012 0.9997256
2219    chr9      FCN1  1.6653676 5.230000  2.669790 0.0109441892 0.9997256
80335   chr3     WDR82  1.0508389 5.966129  2.363655 0.0230823491 0.9997256
               B
3737   -4.405096
6670   -4.420213
7334   -4.441569
56999  -4.442297
23409  -4.451648
54465  -4.453891
339967 -4.454001
5782   -4.456015
2219   -4.461125
80335  -4.462878

As we can see in the table, the adjusted p-values are close to 1. For this reason, they are far from being significant.

7.1 Raw p-values distribution

Under the null-hypothesis, the distribution of raw p-values must be uniform. We plot a histogram with the p-values and a QQ-plot (Figure 14).

Both plots show that the disribution is far from being uniform. This may be because some variability not explained neither by the biological factor nor by the surrogate variables. More quality assessments could be done in order to correct for this biases.

p-values and QQ-plot.

Figure 14: p-values and QQ-plot

8 Classical functional annotation

As there is no differentially expressed gene, we cannot do any classical functional enrichment, since it is necessary to have a list of differentially expressed genes.

9 Gene Set Enrichmen analysis (GSEA)

geneUniverse <- rownames(se.filt)
length(geneUniverse)
[1] 11866

In this case a method for pathway analysis that addresses this shortcoming by assessing differential expression directly at gene set level is used. Therfore, small but consistent changes occurring for a number of genes operating in a common pathway will be found. To perform this it is calculated for each gene set an enrichment score (ES) as function of the changes in gene expression by the genes forming the gene set.

The used gene set collection has been downloaded from GSEA. It is configured of gene sets that represent signatures of cellular pathways which are often disregulated in cancer.

GeneSetCollection
  names: GLI1_UP.V1_DN, GLI1_UP.V1_UP, ..., LEF1_UP.V1_UP (189 total)
  unique identifiers: 22818, 143384, ..., 79649 (11250 total)
  types in collection:
    geneIdType: EntrezIdentifier (1 total)
    collectionType: NullCollection (1 total)
[1] 189
[1] "GLI1_UP.V1_DN" "GLI1_UP.V1_UP" "E2F1_UP.V1_DN" "E2F1_UP.V1_UP"
[5] "EGFR_UP.V1_DN" "EGFR_UP.V1_UP"

There are 189 gene sets in this gene sets collection.

First we map the identifiers from the gene sets to the identifiers of the data we are going to analyze.

gsc <- mapIdentifiers(entrezOncogens, AnnoOrEntrezIdentifier(metadata(se.filt)$annotation))
gsc
GeneSetCollection
  names: GLI1_UP.V1_DN, GLI1_UP.V1_UP, ..., LEF1_UP.V1_UP (189 total)
  unique identifiers: 22818, 143384, ..., 79649 (11250 total)
  types in collection:
    geneIdType: EntrezIdentifier (1 total)
    collectionType: NullCollection (1 total)

in this case, nothing has happend, we could jump this step because data is already anchorated to Entrez identifiers. Now, we start with an incidence matrix indicating what genes belong to what gene set.

Im <- incidence(gsc)
dim(Im)
[1]   189 11250
Im[1:2, 1:10]
              22818 143384 140711 57583 81669 54432 79712 23596 91543 6580
GLI1_UP.V1_DN     1      1      1     1     1     1     1     1     1    1
GLI1_UP.V1_UP     0      0      0     0     0     0     0     0     0    0

The incidence matrix is a matrix in which the rows represent the gene sets, the columns represent the gene in entrez identifier, and the data is 1 or 0, depending whether the gene is in the gene set or it is not.

Next, we discard genes (columns in the incidence matrix) that do not form part of our data.

Im <- Im[, colnames(Im) %in% rownames(se.filt)]
dim(Im)
[1]  189 6067

From 11250 of the genes in the gene sets, only 6067 are in our experiment. The rest may have been filtered during the quality assessment process.

As not all the genes in our experiment are present in the gene sets, we discard them.

se.filt <- se.filt[colnames(Im), ]
dim(se.filt)
[1] 6067   94
dge.filt <- dge.filt[colnames(Im), ]
dim(dge.filt)
[1] 6067   94

Now we perform limma pipeline combined with SVA to create a table of genes and their t-statistic to do the GSEA. The pipeline is the same as in differential expression analysis, but now with less genes (we have filtered them)

library(limma)
library(sva)
mod <- model.matrix(~se.filt$type + bcr_patient_barcode, data = colData(se.filt))
mod0 <- model.matrix(~1, data = colData(se.filt))
sv <- sva(assays(se.filt)$logCPM, mod = mod, mod0 = mod0)
Number of significant surrogate variables is:  11 
Iteration (out of 5 ):1  2  3  4  5  
mod <- cbind(mod, sv$sv)
colnames(mod) <- c(colnames(mod)[1:48], paste0("SV", 1:sv$n))
v <- voom(dge.filt, mod)
fit <- lmFit(assays(se.filt)$logCPM, mod)
fit <- eBayes(fit, trend = TRUE)
tt <- topTable(fit, coef = 2, n = Inf)

Now we can calculate the z-score, that gives us more robustness about analyzing the moderated t-statistic. We select only those gene sets with more than 5 genes and sort the z-score to see the most significant gene sets.

Im <- Im[rowSums(Im) >= 5, ]
dim(Im)
[1]  189 6067
tGSgenes <- tt[match(colnames(Im), rownames(tt)), "t"]
length(tGSgenes)
[1] 6067
head(tGSgenes)
[1] -0.1480484 -0.6165308 -0.3180018  0.9846855 -1.3489488  0.2229815
zS <- sqrt(rowSums(Im)) * (as.vector(Im %*% tGSgenes)/rowSums(Im))
length(zS)
[1] 189
head(zS)
GLI1_UP.V1_DN GLI1_UP.V1_UP E2F1_UP.V1_DN E2F1_UP.V1_UP EGFR_UP.V1_DN 
    0.0813339     0.5123863     1.7608323     0.7978159     2.3626425 
EGFR_UP.V1_UP 
    1.1886709 
rnkGS <- sort(abs(zS), decreasing = TRUE)
head(rnkGS)
         ERB2_UP.V1_DN           MEK_UP.V1_DN CAHOY_OLIGODENDROCUTIC 
              3.040329               2.712168               2.507218 
          WNT_UP.V1_DN      KRAS.600_UP.V1_UP          EGFR_UP.V1_DN 
              2.397208               2.376572               2.362642 

9.1 QQ plot

A QQ plot (Figure 15) may show visually how the z-scores are distributed.

QQ-plot.

Figure 15: QQ-plot

This QQ plot shows that the z-scores do not follow a normal distribution. Under the null hypothesis, those z-scores representing gene sets that are not enriched follow a normal distribution, but this does not happens. Maybe because overlapping gene sets or because as this gene set is specific for cancer and we are analyzing cancer cells, the pathways are all significant.

Let’s adjust for the p-value to fins differential expressed gene sets.

pv <- pmin(pnorm(zS), 1 - pnorm(zS))
pvadj <- p.adjust(pv, method = "fdr")
DEgs <- names(pvadj)[which(pvadj < 0.01)]
length(DEgs)
[1] 0

There is no gene set enriched.

10 Gene Set Variation Analysis (GSVA)

GSVA is another way to analyze the gene sets without a list of differentially expressed genes. This has some properties that can be useful when differential expression analysis and GSEA do not work.

library(GSVA)
GSexpr <- gsva(assays(se.filt)$logCPM, gsc, min.sz=5, max.sz=300, verbose=FALSE)
dim(GSexpr)
[1] 189  94

We create an expression data matrix, in which the rows are gene sets, the columns are samples and the data is the ES. We have 189 gene sets and 94 samples. We perform SVA with this matrix and perform a differential expression pipeline, but instead of doing it with genes, we do it with gene sets.

mod <- model.matrix(~se.filt$type + bcr_patient_barcode, data = colData(se.filt))
mod0 <- model.matrix(~1, data = colData(se.filt))
svaobj <- sva(GSexpr, mod, mod0)
Number of significant surrogate variables is:  7 
Iteration (out of 5 ):1  2  3  4  5  
modSVs <- cbind(mod, svaobj$sv)

fit <- lmFit(GSexpr, modSVs)
fit <- eBayes(fit)
tt <- topTable(fit, coef = 2, n = Inf)
DEgs <- rownames(tt[tt$adj.P.Val < 0.01, , drop = FALSE])
length(DEgs)
[1] 105

At the end, we have 105 gene sets differentially expressed. The most significant ones can be seen at the top table tt.

head(tt)
                        logFC     AveExpr         t      P.Value
JNK_DN.V1_DN       -0.2591644  0.02798997 -14.64205 1.946661e-19
HINATA_NFKB_MATRIX -0.6320939  0.05210125 -13.05664 1.680177e-17
ATF2_S_UP.V1_DN    -0.2084399  0.02713102 -12.67754 5.110700e-17
BCAT_GDS748_DN     -0.4665696  0.04177321 -12.24179 1.877812e-16
ALK_DN.V1_DN        0.2372378 -0.01974942  11.72500 9.071760e-16
RELA_DN.V1_DN      -0.1993592  0.01855533 -10.84520 1.434217e-14
                      adj.P.Val        B
JNK_DN.V1_DN       3.679190e-17 34.05338
HINATA_NFKB_MATRIX 1.587768e-15 29.65628
ATF2_S_UP.V1_DN    3.219741e-15 28.55640
BCAT_GDS748_DN     8.872660e-15 27.26864
ALK_DN.V1_DN       3.429125e-14 25.70860
RELA_DN.V1_DN      4.517782e-13 22.97125

Although the log fold-change is not large, they are very significant.

10.1 Volcano plot of GSVA

A volcano plot can help us to visualize the results. In figure 16 It is seen to be a lot of gene sets differentially expressed as well.

Volcano plot.

Figure 16: Volcano plot

11 Differential expression analysis with limma-voom

In this cas there is no surogate variables analysis. Limma-voom method is performed to adjust for mean-variance relationship. We use lmFit function to calculate the linear model and eBayes function to calculate the moderated t-statistic. Finally, decideTests function will classify into upregulated, downregulated or non-significant.
Voom plot

(#fig:voom2, )Voom plot

       (Intercept) typetumor bcr_patient_barcodeTCGA-22-4609
Down            19      4402                               0
NotSig         550      2147                           11865
Up           11297      5317                               1
       bcr_patient_barcodeTCGA-22-5471 bcr_patient_barcodeTCGA-22-5472
Down                                 1                               1
NotSig                           11860                           11863
Up                                   5                               2
       bcr_patient_barcodeTCGA-22-5478 bcr_patient_barcodeTCGA-22-5481
Down                                 1                               8
NotSig                           11864                           11854
Up                                   1                               4
       bcr_patient_barcodeTCGA-22-5482 bcr_patient_barcodeTCGA-22-5483
Down                                 1                               1
NotSig                           11864                           11864
Up                                   1                               1
       bcr_patient_barcodeTCGA-22-5489 bcr_patient_barcodeTCGA-22-5491
Down                                 0                               1
NotSig                           11866                           11863
Up                                   0                               2
       bcr_patient_barcodeTCGA-33-4587 bcr_patient_barcodeTCGA-33-6737
Down                                49                               2
NotSig                           11727                           11859
Up                                  90                               5
       bcr_patient_barcodeTCGA-34-7107 bcr_patient_barcodeTCGA-34-8454
Down                                 0                              10
NotSig                           11865                           11853
Up                                   1                               3
       bcr_patient_barcodeTCGA-39-5040 bcr_patient_barcodeTCGA-43-3394
Down                                38                               1
NotSig                           11757                           11863
Up                                  71                               2
       bcr_patient_barcodeTCGA-43-5670 bcr_patient_barcodeTCGA-43-6143
Down                                 3                               0
NotSig                           11859                           11865
Up                                   4                               1
       bcr_patient_barcodeTCGA-43-6647 bcr_patient_barcodeTCGA-43-6771
Down                                 6                               0
NotSig                           11859                           11864
Up                                   1                               2
       bcr_patient_barcodeTCGA-43-6773 bcr_patient_barcodeTCGA-43-7657
Down                                 4                              10
NotSig                           11851                           11851
Up                                  11                               5
       bcr_patient_barcodeTCGA-43-7658 bcr_patient_barcodeTCGA-51-4079
Down                                53                               5
NotSig                           11736                           11858
Up                                  77                               3
       bcr_patient_barcodeTCGA-51-4080 bcr_patient_barcodeTCGA-51-4081
Down                                 1                               3
NotSig                           11864                           11853
Up                                   1                              10
       bcr_patient_barcodeTCGA-56-7222 bcr_patient_barcodeTCGA-56-7579
Down                                 1                               0
NotSig                           11859                           11863
Up                                   6                               3
       bcr_patient_barcodeTCGA-56-7580 bcr_patient_barcodeTCGA-56-7582
Down                                 3                               3
NotSig                           11862                           11860
Up                                   1                               3
       bcr_patient_barcodeTCGA-56-7730 bcr_patient_barcodeTCGA-56-7731
Down                                24                               7
NotSig                           11817                           11856
Up                                  25                               3
       bcr_patient_barcodeTCGA-56-8309 bcr_patient_barcodeTCGA-56-8623
Down                                 4                             349
NotSig                           11853                           11191
Up                                   9                             326
       bcr_patient_barcodeTCGA-58-8386 bcr_patient_barcodeTCGA-60-2709
Down                                 0                               1
NotSig                           11864                           11864
Up                                   2                               1
       bcr_patient_barcodeTCGA-77-7138 bcr_patient_barcodeTCGA-77-7142
Down                                 7                             126
NotSig                           11857                           11701
Up                                   2                              39
       bcr_patient_barcodeTCGA-77-7335 bcr_patient_barcodeTCGA-77-7337
Down                                74                               7
NotSig                           11756                           11854
Up                                  36                               5
       bcr_patient_barcodeTCGA-77-7338 bcr_patient_barcodeTCGA-77-8007
Down                                 2                               5
NotSig                           11861                           11860
Up                                   3                               1
       bcr_patient_barcodeTCGA-77-8008 bcr_patient_barcodeTCGA-85-7710
Down                                 6                              30
NotSig                           11858                           11787
Up                                   2                              49
       bcr_patient_barcodeTCGA-90-6837 bcr_patient_barcodeTCGA-90-7767
Down                                 0                               7
NotSig                           11866                           11858
Up                                   0                               1
       bcr_patient_barcodeTCGA-92-7340
Down                                10
NotSig                           11850
Up                                   6

Table of results:

genesmd <- data.frame(chr = as.character(seqnames(rowRanges(se.filt))), symbol = rowData(se.filt)[, 1], stringsAsFactors = FALSE)
fit$genes <- genesmd
tt <- topTable(fit, coef = 2, n = Inf)
head(tt, n = 10)
         chr   symbol    logFC  AveExpr        t      P.Value    adj.P.Val
692205  chr2  SNORD89 4.558016 4.061933 29.03018 5.211431e-33 6.183884e-29
9263    chr7   STK17A 5.510503 7.131354 28.35326 1.590799e-32 9.438213e-29
171482  chrX    FAM9A 6.615355 1.406490 28.08485 2.492491e-32 9.858632e-29
121391 chr12    KRT74 3.723965 3.686091 26.15339 7.085033e-31 2.019640e-27
6100    chr7      RP9 3.665029 4.745683 26.05106 8.510198e-31 2.019640e-27
94134  chr10 ARHGAP12 3.110953 3.727910 25.90667 1.103364e-30 2.182085e-27
27316   chrX     RBMX 4.123045 3.581820 25.37018 2.927936e-30 4.963270e-27
81556  chr15     VWA9 4.322738 2.008536 25.09129 4.897154e-30 7.263703e-27
10523  chr19    CHERP 2.953520 3.510743 24.88294 7.214682e-30 9.512158e-27
93129  chr16    ORAI3 4.155809 3.147168 24.79035 8.577857e-30 1.017848e-26
              B
692205 65.01160
9263   64.02774
171482 63.31797
121391 60.18915
6100   60.05726
94134  59.76904
27316  58.77257
81556  58.16998
10523  57.90230
93129  57.68415

As can be seen looking at the p-values, there are significant differential expressed genes.

DE genes:

DEgenes <- rownames(tt)[tt$adj.P.Val < FDRcutoff]
length(DEgenes)
[1] 9719
saveRDS(DEgenes, file.path("results", "DEgenes.rds"))

There are 9719.

11.1 MA plot with the top 10

Top 10 MA plot

Figure 17: Top 10 MA plot

In the volcano plot, Figure 17, is seen the top 10 genes, all of them with a large magnitude fold-change and high statistical significance.

11.2 Plot p-values

Both plots in Figure 18 show that the disribution is far from being uniform. This may be because some variability not explained by the biological factor. Moreover, surrogate variables are not analysed in this case. More quality assessments could be done in order to correct for this biases.

p-values

Figure 18: p-values

12 Gene Ontology analysis

We will start with a Gene Onthology analysis of the DE genes. The first step is safe the identifiers of the of the original set of gene profiled. Then we will follow the main steps of the GO analysis. This steps are: 1) build a parameter object with information specifying the gene universe, the set of DE genes, the annotation package to use, and so one, 2) run the functional enrichment analysis, and 3) store and visualize the results.

geneUniverse <- rownames(se.filt)
length(geneUniverse)
[1] 11866
library(GOstats)
params <- new("GOHyperGParams", geneIds=DEgenes, universeGeneIds=geneUniverse,
            annotation="org.Hs.eg.db", ontology="BP",
            pvalueCutoff=0.05, testDirection="over")

Since a problem in a GO analysis is that the hierarchy of GO terms and their overlap render highly dependent tests. A conditional test is directly used. Then, if parent and child GO term contain the same significant genes, the child node will be retrieved because it is more specific. The no-conditional test is also done to allow the comparison between the two test results.

hgOver <- hyperGTest(params)
hgOver
Gene to GO BP  test for over-representation 
13342 GO BP ids tested (136 have p < 0.05)
Selected gene set size: 7794 
    Gene universe size: 9460 
    Annotation package: org.Hs.eg 
conditional(params) <- TRUE
hgOverCond <- hyperGTest(params)
hgOverCond
Gene to GO BP Conditional test for over-representation 
13342 GO BP ids tested (85 have p < 0.05)
Selected gene set size: 7794 
    Gene universe size: 9460 
    Annotation package: org.Hs.eg 

As spected we have less significant GO terms when conditional test is used than when we perform a non-conditional test. The number of significant GO terms have chenged from 136 to 85.

This is the data.frame object with the results:

goresults <- summary(hgOverCond)
head(goresults)
      GOBPID      Pvalue OddsRatio  ExpCount Count Size
1 GO:0045844 0.001653831       Inf  27.18837    33   33
2 GO:0051153 0.004324410       Inf  23.06004    28   28
3 GO:0055013 0.005309072       Inf  22.24503    27   27
4 GO:0043038 0.007830366       Inf  20.59725    25   25
5 GO:0010501 0.008055666  7.510633  29.66004    35   36
6 GO:0007507 0.008726619  1.588209 205.87924   220  250
                                                       Term
1 positive regulation of striated muscle tissue development
2        regulation of striated muscle cell differentiation
3                           cardiac muscle cell development
4                                     amino acid activation
5                         RNA secondary structure unwinding
6                                         heart development

GO terms involving a few genes (e.g., < 3) in their size (m) and in their enrichment by DE genes (k) are likely to be less reliable than those that involve many genes. Likewise, large GO terms may provide little insight. To try to spot the more interesting and reliable GO terms we can filter the previous results by a minimum value on the Count and Size columns, a maximum Count value, and order them by the OddsRatio column:

goresults <- goresults[goresults$Size >= 3 & goresults$Size <= 300 & goresults$Count >= 3, ]
goresults <- goresults[order(goresults$OddsRatio, decreasing=TRUE), ]
head(goresults)
       GOBPID      Pvalue OddsRatio ExpCount Count Size
1  GO:0045844 0.001653831       Inf 27.18837    33   33
2  GO:0051153 0.004324410       Inf 23.06004    28   28
3  GO:0055013 0.005309072       Inf 22.24503    27   27
4  GO:0043038 0.007830366       Inf 20.59725    25   25
9  GO:0006418 0.011547980       Inf 18.94947    23   23
12 GO:0010591 0.014023396       Inf 18.12558    22   22
                                                        Term
1  positive regulation of striated muscle tissue development
2         regulation of striated muscle cell differentiation
3                            cardiac muscle cell development
4                                      amino acid activation
9                tRNA aminoacylation for protein translation
12                      regulation of lamellipodium assembly

13 Gene Set Enrichment Analysis (GSEA)

In this case a method for pathway analysis that addresses this shortcoming by assessing DE directly at gene set level is used. Therfore, small but consistent changes occurring for a number of genes operating in a common pathway will be found. To perform this it is calculated for each gene set an enrichment score (ES) as function of the changes in gene expression by the genes forming the gene set.

The used gene set collection has been downloaded from GSEA. It is configured of gene sets that represent signatures of cellular pathways which are often disregulated in cancer.

GeneSetCollection
  names: GLI1_UP.V1_DN, GLI1_UP.V1_UP, ..., LEF1_UP.V1_UP (189 total)
  unique identifiers: 22818, 143384, ..., 79649 (11250 total)
  types in collection:
    geneIdType: EntrezIdentifier (1 total)
    collectionType: NullCollection (1 total)
[1] 189
[1] "GLI1_UP.V1_DN" "GLI1_UP.V1_UP" "E2F1_UP.V1_DN" "E2F1_UP.V1_UP"
[5] "EGFR_UP.V1_DN" "EGFR_UP.V1_UP"

First we need to map the identifiers from the gene sets to the identifiers of the data we are going to analyze:

gsc <- mapIdentifiers(entrezOncogens, AnnoOrEntrezIdentifier(metadata(se.filt)$annotation))
gsc
GeneSetCollection
  names: GLI1_UP.V1_DN, GLI1_UP.V1_UP, ..., LEF1_UP.V1_UP (189 total)
  unique identifiers: 22818, 143384, ..., 79649 (11250 total)
  types in collection:
    geneIdType: EntrezIdentifier (1 total)
    collectionType: NullCollection (1 total)

Nothing has happend, we can jump this step because data is already anchorated to Entrez ientifiers. Then, we have to start with an incidence matrix indicating what genes belong to what gene set:

Im <- incidence(gsc)
dim(Im)
[1]   189 11250
Im[1:2, 1:10]
              22818 143384 140711 57583 81669 54432 79712 23596 91543 6580
GLI1_UP.V1_DN     1      1      1     1     1     1     1     1     1    1
GLI1_UP.V1_UP     0      0      0     0     0     0     0     0     0    0

Next, we discard genes (columns in Im) that do not form part of our data:

Im <- Im[, colnames(Im) %in% rownames(se.filt)]
dim(Im)
[1]  189 6067

Likewise, not all genes in our data are annotated to gene sets, so we also discard them:

se.filt <- se.filt[colnames(Im), ]
dim(se.filt)
[1] 6067   94
dge.filt <- dge.filt[colnames(Im), ]
dim(dge.filt)
[1] 6067   94
Voom plot

Figure 19: Voom plot

[1]  189 6067
[1] 6067
[1]  5.547354  7.512204  5.549996 -8.950190 -7.892440  7.462655
[1] 189
GLI1_UP.V1_DN GLI1_UP.V1_UP E2F1_UP.V1_DN E2F1_UP.V1_UP EGFR_UP.V1_DN 
     2.918573     -2.710980     -6.409755     -3.131997      1.162975 
EGFR_UP.V1_UP 
    -2.551358 
   JNK_DN.V1_DN ATF2_S_UP.V1_DN   SNF5_DN.V1_UP    ALK_DN.V1_DN 
       22.04500        20.28736        19.57452        18.89280 
 BCAT_GDS748_DN   KRAS.DF.V1_DN 
       18.48136        17.94397 

14 Gene Set Variation Analysis (GSVA)

library(GSVA)
GSexpr <- gsva(assays(se.filt)$logCPM, gsc, min.sz=5, max.sz=300, verbose=FALSE)
class(GSexpr)
[1] "matrix"
dim(GSexpr)
[1] 189  94
mod <- model.matrix(~se.filt$type + bcr_patient_barcode, data = colData(se.filt))

fit <- lmFit(GSexpr, mod)
fit <- eBayes(fit)
tt <- topTable(fit, coef = 2, n = Inf)
DEgs <- rownames(tt[tt$adj.P.Val < 0.01, , drop = FALSE])
length(DEgs)
[1] 108

14.1 GSVA volcano plot

GSVA volcano plot

Figure 20: GSVA volcano plot

sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=ca_ES.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=ca_ES.UTF-8        LC_COLLATE=ca_ES.UTF-8    
 [5] LC_MONETARY=ca_ES.UTF-8    LC_MESSAGES=ca_ES.UTF-8   
 [7] LC_PAPER=ca_ES.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    methods   stats     graphics  grDevices utils    
[8] datasets  base     

other attached packages:
 [1] GO.db_3.5.0                GOstats_2.44.0            
 [3] Category_2.44.0            Matrix_1.2-14             
 [5] GSVA_1.26.0                org.Hs.eg.db_3.5.0        
 [7] GSEABase_1.40.1            graph_1.56.0              
 [9] sva_3.26.0                 BiocParallel_1.12.0       
[11] genefilter_1.60.0          mgcv_1.8-23               
[13] nlme_3.1-137               geneplotter_1.56.0        
[15] annotate_1.56.2            XML_3.98-1.11             
[17] AnnotationDbi_1.40.0       lattice_0.20-35           
[19] edgeR_3.20.9               limma_3.34.9              
[21] SummarizedExperiment_1.8.1 DelayedArray_0.4.1        
[23] matrixStats_0.53.1         Biobase_2.38.0            
[25] GenomicRanges_1.30.3       GenomeInfoDb_1.14.0       
[27] IRanges_2.12.0             S4Vectors_0.16.0          
[29] BiocGenerics_0.24.0        knitr_1.20                
[31] BiocStyle_2.6.1           

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17           locfit_1.5-9.1         rprojroot_1.3-2       
 [4] digest_0.6.15          mime_0.5               R6_2.2.2              
 [7] backports_1.1.2        RSQLite_2.1.0          evaluate_0.10.1       
[10] highr_0.6              zlibbioc_1.24.0        Rgraphviz_2.22.0      
[13] blob_1.1.1             rmarkdown_1.9          shinythemes_1.1.1     
[16] splines_3.4.4          stringr_1.3.1          RCurl_1.95-4.10       
[19] bit_1.1-14             shiny_1.1.0            compiler_3.4.4        
[22] httpuv_1.4.3           xfun_0.1               pkgconfig_2.0.1       
[25] htmltools_0.3.6        GenomeInfoDbData_1.0.0 bookdown_0.7          
[28] codetools_0.2-15       AnnotationForge_1.20.0 later_0.7.2           
[31] bitops_1.0-6           grid_3.4.4             RBGL_1.54.0           
[34] xtable_1.8-2           DBI_1.0.0              magrittr_1.5          
[37] stringi_1.2.2          XVector_0.18.0         promises_1.0.1        
[40] RColorBrewer_1.1-2     tools_3.4.4            bit64_0.9-7           
[43] survival_2.42-3        yaml_2.1.19            memoise_1.1.0